Task 1¶
Let $f(x_1, x_2) = (x_1 + x_2)^2$ and assume that $x_1, x_2 \sim U[-1,1]$ and $x_1=x_2$ (full dependency) Calculate PD, ME and AL profiles for variable x1 in this model.
$$ \begin{align} g^{PD}_{x_1}(z) &= E_{x_2}[f(z, x_2)] = \\ &= z^2 + E_{x_2}[x_2^2] + 2zE_{x_2}[x_2] = \\ &= z^2 + 2z \int_{-1}^{1} \frac{x_2}{1-(-1)} \,dx_2 + \int_{-1}^{1}\frac{ x_2^2}{1-(-1)} \,dx_2 = \\ &= z^2 + 2z\frac{x^2}{4}\Big|_{-1}^1 + \frac{x^3}{6}\Big|_{-1}^1 = z^2 + \frac{z(1 - 1)}{2} + \frac{1 - (-1)}{6}\\ &= z^2 + \frac{1}{3} \end{align} $$
$$ \begin{align} g^{ME}_{x_1}(z) &= E_{x_2}[f(z, x_2)\;|\;x_1=z] = \\ &= E_{x_2}[(z+x_2)^2\;|\; x_1 = z] = \\ & \Big\{ x_1=x_2 \Big\} \\ &= E_{x_2}[(2z)^2] =\\ &= 4z^2 \end{align} $$
$$ \begin{align} g^{AL}_{x_1}(z) &= \int_{-1}^z E\Big[ \frac{\partial f(v,x_2)}{\partial v}\;|\; x_1 = v\Big] \,dv = \\ &= \int_{-1}^z E\Big[ \frac{\partial\;v^2 + x_2^2 + 2vx_2}{\partial v}\;|\; x_1 = v\Big] \,dv = \\ &= \int_{-1}^z E\Big[ 2v + 2x_2\;|\; x_1 = v\Big] \,dv = \\ & \Big\{ x_1=x_2 \Big\} \\ &= \int_{-1}^z E[ 4v] \,dv = \\ &= 4\int_{-1}^z v \,dv = \\ &= 4\frac{v^2}{2} \Big|_{-1}^z =\\ &= 2z^2 - 2(-1)^2 =\\ &= 2(z^2-1) \end{align} $$
Task 2¶
0. For the selected data set, train at least one tree-based ensemble model, e.g. random forest, gbdt, xgboost.¶
For this task I reused my models from homework 1 (with default values of hyperparameters, which turned out to make them learn slightly better). Below the performance of random forest classifier.
explainer.model_performance()
1. Calculate the predictions for some selected observations.¶
I chose two observations with associated different TARGET value. Apart from presenting exact prediction, I'm also displaying detailed coefficients how much the model was prone to choose 0 or 1. Results are obtained using random forest classifier.
show_selected_predictions()
2. Then, calculate the what-if explanations of these predictions using Ceteris Paribus profiles¶
I will display results for both observations on the same figures for better readability. I chose the same observations and model as above.
cp = explainer.predict_profile(new_observation=selected_observations_X)
cp.plot()
3. Find two observations in the data set, such that they have different CP profiles. For example, model predictions are increasing with age for one observation and decreasing with age for another one.¶
I've carefully chosen observations 46 and 137 in point (2) to meet that requirement. We can observe not only different values on those figures, but also we can spot dissimilar trends (e.g. profiles of feature 0 are symmetric, feature 6 has completely different profiles). Although the overall shape of those figures are similar, they can sometimes differ only slightly, for example the profile of feature 5 initially dives near 0 (for 37'th observation), then comes back and aligns with the figure associated with (147'th observation).
4. Compare CP, which is a local explanation, with PDP, which is a global explanation¶
pdp = explainer.model_profile()
pdp.plot(geom="profiles", variables=['Feature 0', 'Feature 1', 'Feature 2', 'Feature 3', 'Feature 4', 'Feature 5', 'Feature 6', 'Feature 7', 'Feature 8', 'Feature 9'])
PDP aligns perfectly with the majority of cp profiles (apart from a limited number of deviations like the example above). I was able to produce the image only in browser and it may not be visible there. However, I managed to download it, the image is in the newplot.png file
5. Compare PDP between at least two different models.¶
I'm going to compare my two models from HW1 - already tested random forest, and linear regression.
pdp2 = explainer2.model_profile()
pdp.plot(pdp2)
What strikes first is the smoothness of logistic regression's profiles. Almost all of them looks like the interpolation of more detailed versions associated with random forest. There appears the conclusion that although logistic regression was able to asses more or less accurately the contribution of ach variable, it remained beyond its comprehension to spot detailed relationships as precisely as random forest model did.
Appendix¶
Load libraries and dataset:
import dalex as dx
import matplotlib
import platform
import sklearn
from copy import deepcopy
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, recall_score, precision_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# https://github.com/adrianstando/imbalanced-benchmarking-set/blob/main/datasets/wine_quality.csv
dataset = pd.read_csv('wine_quality.csv')
y = dataset["TARGET"].apply(lambda x: 1 if x == 1 else 0)
X_tmp = dataset.drop(axis=1, columns=["TARGET", "Unnamed: 0"])
# rename columns
X = pd.DataFrame()
X_tmp.head()
for i in range(0,10):
X[f"Feature {i}"] = X_tmp[str(i)]
X.info()
Train models
model = RandomForestClassifier(random_state=22)
model.fit(X.values, y)
pred = model.predict(X)
model2 = LogisticRegression(random_state=22, max_iter=1000)
model2.fit(X.values, y)
pred2 = model.predict(X)
def fn(model, df):
return model.predict_proba(df)[:, 1]
explainer = dx.Explainer(model, X, y, predict_function=fn)
explainer2 = dx.Explainer(model2, X, y, predict_function=fn)
def _construct_selected_observations():
selected_observations_indices = [46, 147]
result_X = X.iloc[selected_observations_indices]
result_y = y.iloc[selected_observations_indices]
pred_y = model.predict_proba(result_X)
return result_X, result_y, pred_y
selected_observations_X, selected_observations_y, selected_observations_pred = _construct_selected_observations()
def show_selected_predictions():
result = deepcopy(selected_observations_X)
result["TARGET"] = selected_observations_y
result["pred 0"] = selected_observations_pred[:, 0]
result["pred 1"] = selected_observations_pred[:, 1]
return result